fix: CPU OOM issue during LoRA training by SongwuJob · Pull Request #41 · OpenMOSS/MOVA

SongwuJob · 2026-02-27T08:08:49Z

When conducting fine-tuning with the provided LoRA training script, CPU memory usage continuously increases over time and eventually the process is killed by the system due to out-of-memory (OOM).

The issue is caused by enabling torch.cuda.memory._record_memory_history(enabled="all"), which records CUDA memory events and stores them on the CPU. As training progresses, the accumulated memory history leads to excessive CPU memory consumption, resulting in CPU OOM.

fix: CPU OOM issue during LoRA training

c967bff

Phospheneser requested review from Phospheneser and yhzx233 March 4, 2026 02:12

Phospheneser approved these changes Mar 4, 2026

View reviewed changes

yhzx233 approved these changes Mar 4, 2026

View reviewed changes

xiami2019 merged commit ee050e4 into OpenMOSS:main Mar 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: CPU OOM issue during LoRA training#41

fix: CPU OOM issue during LoRA training#41
xiami2019 merged 1 commit intoOpenMOSS:mainfrom
SongwuJob:main

SongwuJob commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

SongwuJob commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants